Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':
combine
air <-read_csv("taiwan_air_quality_mod.csv")
New names:
• `` -> `...1`
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 5882208 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): sitename, county, pollutant, co, o3, o3_8hr, pm10, pm2.5, windspe...
dbl (11): ...1, aqi, so2, no2, nox, no, pm2.5_avg, so2_avg, longitude, lati...
lgl (1): unit
dttm (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
air
Background Research
Variable Abbreviations:
AQI: Air Quality Index, the higher the worse it is.
pollutant: The type of pollutant it is.
so2: Sulfur Dioxide in ppb
co: Carbon Monoxide in ppm
o3: Ozone in ppb
o3_8hr 8-hour average of Ozone
pm10: Particulate matter under 10μm
pm2.5: Particulate matter under 2.5μm
no2: Nitrogen Dioxide in ppb
nox: Nitrogen Oxides in ppb
no: Nitric Oxide in ppb
co_8hr: 8-hour average of CO
pm2.5_avg: Moving average of PM2.5
pm10_avg: Moving average of PM10
so2_avg: Moving average of SO2
PM2.5: Fine particulate matter with a diameter of less than 2.5 micrometers. These particles are capable of penetrating deep into the lungs and bloodstream, causing respiratory and cardiovascular issues.
PM10: Particulate matter with a diameter less than 10 micrometers. These particles can cause respiratory irritation but are not as invasive as PM2.5.
SO2: Sulfur Dioxide is a toxic gas produced primarily by fossil fuel combustion, which can lead to respiratory problems.
NOx: A group of nitrogen oxides (including NO2 and NO) generated by combustion engines, contributing to smog and acid rain.
CO: Carbon Monoxide is a colorless, odorless gas that can be harmful when inhaled in large amounts. It primarily originates from vehicle emissions and other combustion sources.
O3: Ozone is a gas that forms in the atmosphere and is harmful at ground level. It’s a major component of smog and can cause respiratory issues.
Source for this information: https://www.kaggle.com/datasets/taweilo/taiwan-air-quality-data-20162024
This data set is about Taiwan’s air quality, or AQI. Having a low or high AQI directly translates into better or worse health for the people living there; it is a crucial variable for determining how healthy a location is. I am going to ask 6 questions, all relating to Taiwan’s AQI. Hopefully, this “report” will help showcase the factors that determine air quality so we can understand what causes good or poor AQI scores.
Question 1. Which type of main pollutant is most prevalant in the counties, and what are their AQI’s?
First, we need to clean some of the data so that we can make a visualization.
To start, we have to get the date in a better format.
air |>mutate(Date =as.Date(date, format ="%d/%m/%y")) |>mutate(Year =format(Date, "%Y")) |>mutate(Month =format(Date, "%m")) |>mutate(Day =format(Date, "%d"))-> air.dateair.date
air.pollutant |>ggplot(aes(x = pollutant, y = count, fill = aqi.bar)) +geom_bar(stat ="identity") +scale_fill_gradient(low ="deepskyblue", high ="darkblue") +labs(title ="Number of Each Pollutant in the Summer of 2023",x ="Pollutant",y ="Count",fill ="Average AQI") +theme_minimal()
This graph shows the number of times a type of pollutant was the main pollutant in the summer of 2023. It also shows the average AQI through the color scale on the right side of the graph. The darker the color, the worse the AQI is when that pollutant is the most common. Interestingly, PM2.5 dominates as the most common pollutant by far. However, Ozone seems to cause the worst AQI, as you can see by the darkest color of the bar. Finally S02 is rarely the AQI, and is barely visible on the graph. Using this information, it would appear that Taiwan should primarirly focus on reducing the amount of PM2.5 and Ozone particles in order to improve their air quality.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `windspeed = as.double(windspeed)`.
Caused by warning:
! NAs introduced by coercion
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
air.time
air.time |>drop_na(aqi.bar, windspeed.bar) |>mutate(Month =factor(Month, levels =1:12)) |># Convert month to factorggplot(aes(Month, aqi.bar, group = Year)) +# Group by Year because each graph is one yeargeom_line(color ="purple") +theme_classic() +ggtitle("Taiwan's AQI Over Time") +labs(x ="Month", y ="AQI") +facet_wrap(~Year) +scale_x_discrete(labels = month.name) +# Replace factor levels with month namesscale_y_continuous(expand =c(0, NA)) +theme(axis.text.x =element_text(angle =45, hjust =1))
This graph shows the average AQI for each month over a time period of several years. Note, there is not much data in 2016, but there are still 2 months of data that can be shown. The peak of this data set appears to have happened in 2017, when the AQI was over 80. The lowest the AQI ever hits is below 30 in 2022. Overall, looking at all of the years in this data set, it does appear that the air quality is improving, albeit slowly. However, the difference is so small, that it may not be statistically significant. Most interestingly, June, July, and August always have the lowest AQI. These are the summer months in Taiwan, so I would guess that warmer temperatures might lead to lower AQI. Unfortunately, temperature is not a variable in this data set, so I have no direct way to verify this.
air.time |>drop_na(aqi.bar, windspeed.bar) |>mutate(Month =factor(Month, levels =1:12)) |># Convert month to factorggplot(aes(x = Month, y = aqi.bar, group = Year, color =as.factor(Year))) +# Group and color by Yeargeom_line() +theme_classic() +ggtitle("Taiwan's AQI Over Time") +labs(x ="Month", y ="AQI", color ="Year") +# Add legend for Yearscale_x_discrete(labels = month.name) +# Replace factor levels with month namesscale_y_continuous(expand =c(0, NA)) +theme(axis.text.x =element_text(angle =45, hjust =1))
This graph shows the same data as the previous graph, but in a different format. Here, it is much easier to see that all of the years have the same aqi trends. The summer months have better air quality than colder months do.
Let’s make one more graph using this data, this time a heat map.
air.time |>drop_na(aqi.bar, windspeed.bar) |>mutate(Month =factor(Month, levels =1:12), Year =factor(Year) ) |>ggplot(aes(x = Month, y = Year, fill = aqi.bar)) +geom_tile() +theme_classic() +ggtitle("Taiwan's AQI Over Time") +labs(x ="Month", y ="Year", fill ="AQI") +# Add legend for AQIscale_x_discrete(labels = month.name) +# Replace factor levels with month namesscale_fill_gradient(low ="blue", high ="orange") +# Color gradient for AQItheme(axis.text.x =element_text(angle =45, hjust =1)) # Rotate x-axis labels
Ok, this is just another cool way to showcase the same data, this time as a heat map. The center band is all blue. This is the maps’s representation of the dip in the aqi levels for the summer months. You can also see that the winter months here have worse AQI than any other. I am guessing that temperature might have to do with aqi, but that variable is not in the data set. Anyway, this might be the best way to look at this data as it is colorful, and easy to read, but still very useful!
3. Is there any relationship between N0X (Nitrogen Oxide) and CO (Carbon Monoxide)? If so, what does the linear model say about their relationship?
For this question, I am going to look at the past two years in the least analyzed counties.
lm(nox ~ co, data = air.nox) -> nox.modelData # order is y, then xnox.modelData
Call:
lm(formula = nox ~ co, data = air.nox)
Coefficients:
(Intercept) co
-2.055 41.064
Now I am going to make a model for each county.
air.nox |>filter(county =="Nantou County") -> air.nanlm(nox ~ co, data = air.nan) -> nan.modelData # order is y, then xnan.modelData
Call:
lm(formula = nox ~ co, data = air.nan)
Coefficients:
(Intercept) co
0.5758 32.8481
air.nox |>filter(county =="Changhua County") -> air.changlm(nox ~ co, data = air.chang) -> chang.modelData # order is y, then xchang.modelData
Call:
lm(formula = nox ~ co, data = air.chang)
Coefficients:
(Intercept) co
-0.07843 38.19917
air.nox |>filter(county =="Pingtung County") -> air.pinglm(nox ~ co, data = air.ping) -> ping.modelData # order is y, then xping.modelData
Call:
lm(formula = nox ~ co, data = air.ping)
Coefficients:
(Intercept) co
-4.147 44.999
air.nox |>filter(county =="Yunlin County") -> air.yunlm(nox ~ co, data = air.yun) -> yun.modelData # order is y, then xyun.modelData
Call:
lm(formula = nox ~ co, data = air.yun)
Coefficients:
(Intercept) co
-2.522 42.314
This figure shows the relationship between Carbon Monoxide and Nitrogen Oxide. Specifically, this graph is for December 2023, for 4 counties. All 4 counties have roughly the same outcome, a decently strong correlation between the two variables. Looking at the graph, you can see that as CO increases, so does NOX. This makes sense, as both pollutants come from combustion engines. This means that the more cars are driving around, the more NOX and CO there will be. Note, that this correlation does NOT mean that these two variables are causing each other. Rather, they are simply correlated and both tend to increase or decrease at the same time, probably depending on how many cars are driving on the road. Finally, I made a linear model for all 4 counties, and 4 models for each individual counties, using CO as a predictor variable for NOX. The y-intercept is -2.368, which is the model’s prediction of NOX if the amount of CO is 0. The slope of the line is 41.279. This refers to how steep the linear model is. More explicitly, for ever one unit of CO, NOX increases by about 41 units, according to the model. The other models slopes are about 41, 28, 43, and 39. Besides Nantou’s counties 28, all of the counties have roughly the same slope.
4. Does Northern or Southern Taiwan have worse AQI? What about East and West?
Let’s graph Taiwan!
First, let’s get the data that we want on the graph, the average AQI and the average windspeed for each county.
air.date |>filter(Year ==2023) |>filter(Month =="12") |>drop_na(aqi, county, windspeed) |>group_by(county) |>mutate(# Removing "City" and "County" suffixes to match List 2county =gsub(" (City|County)$", "", county),# Renaming county column to match List 2county =case_when( county =="Lienchiang"~"Lienkiang", # Rename 'Lienchiang' to 'Lienkiang' county =="Chiayi"~"Chiayi County", # Rename 'Chiayi' to 'Chiayi City' county =="Hsinchu"~"Hsinchu County", # Rename 'Hsinchu' to 'Hsinchu City' county =="Yunlin"~"Yulin", # Rename 'Yunlin' to 'Yulin'TRUE~ county # Keep the rest unchanged ) ) |>rename(NAME_2 = county) |>summarise(aqi.bar =mean(aqi), windspeed.bar =mean(windspeed)) -> air.map
Warning: There were 20 warnings in `summarise()`.
The first warning was:
ℹ In argument: `windspeed.bar = mean(windspeed)`.
ℹ In group 1: `NAME_2 = "Changhua"`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
ℹ Run `dplyr::last_dplyr_warnings()` to see the 19 remaining warnings.
air.map
Let’s look at our file.
taiwan <-st_layers("gadm41_TWN.gpkg")taiwan
We need to compare the county names in our data set and the file we downloaded.
unique(air.map$NAME_2)
[1] "Changhua" "Chiayi County" "Hsinchu County" "Hualien"
[5] "Kaohsiung" "Keelung" "Kinmen" "Lienkiang"
[9] "Miaoli" "Nantou" "New Taipei" "Penghu"
[13] "Pingtung" "Taichung" "Tainan" "Taipei"
[17] "Taitung" "Taoyuan" "Yilan" "Yulin"
As you can see, the further south that you go, the worse the AQI seems to get, except along the East coast. After doing a small amount of research, Taiwan has a huge mountain range called Chungyang/Central Mountain Range that separates the East and West of this country. Because of this mountain range, the population is mainly on the West coast. This explains why the average AQI is so much higher on the West, there are more people living there.
5. In 2023, in New Taipei City, is there a correlation between the PM2.5 and AQI?
air.wind |>ggplot(aes(x = pm2.5, y = aqi)) +geom_point(color ="orange", size =3) +geom_smooth(method ="lm") +transition_time(as.numeric(Month)) +# Ensure Month is numericlabs(title ="Month: {month.name[frame_time]}", # Use month name as text x ="PM2.5", y ="AQI") +theme_minimal() -> air.animationanimate(air.animation, enderer =gifski_renderer())
`geom_smooth()` using formula = 'y ~ x'
6. Since we know the Southern area of Taiwan has worse AQI, is there any correlation between the latitude/longitude and windspeed?
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `windspeed = as.numeric(windspeed)`.
Caused by warning:
! NAs introduced by coercion
air.final
Now, let’s graph it!
ggplot(air.final, aes(x = latitude, y = windspeed.bar, group =1)) +geom_smooth(aes(latitude, windspeed.bar), method ="lm") +geom_point(aes(size = aqi.bar), alpha =0.5, color ="goldenrod1") +scale_size_continuous(name ="AQI Value", range =c(3, 15)) +labs(title ="Latitude vs. Windspeed in New Taipei City, 2023", x ="Latitude", y ="Average Windspeed") +theme_bw()
`geom_smooth()` using formula = 'y ~ x'
ggplot(air.final, aes(x = longitude, y = windspeed.bar, group =1)) +geom_smooth(aes(longitude, windspeed.bar), method ="lm") +geom_point(aes(size = aqi.bar), alpha =0.5, color ="goldenrod1") +scale_size_continuous(name ="AQI Value", range =c(3, 15)) +labs(title ="Longitude vs. Windspeed in New Taipei City, 2023", x ="Longitude", y ="Average Windspeed") +theme_bw()
`geom_smooth()` using formula = 'y ~ x'
These graphs shows how latitude, longitude, wind speed, and aqi are all related. Interestingly, it appears that there is a slight trend between latitude and wind speed. As latitude increases, so does the wind speed, which is a direct relationship. There is also a trend between longitude and wind speed. As longitude increases, the wind speed decreases. This is an inverse relationship. However, it appears that the aqi does not have much variance on either of the two graphs. So, according to these figures, longitude and latitude does not affect aqi, even though it does affect wind speed. However, according to external research, aqi is affected by wind speed, so I wonder why that is not the case for this specific data set.
#7. In general, how was Lienchiang’s air quality in the winter of 2023?
First, let’s clean the data.
air.date |>filter(county =="Lienchiang County"& Year ==2023) |>filter(Month %in%c("12", "01", "02", "03")) |>drop_na(Month, aqi) |>mutate(Month =case_when( Month =="01"~"January", Month =="02"~"February", Month =="03"~"March", Month =="12"~"December",TRUE~ Month # Keep as is for unexpected values ),aqiCat =factor(case_when( aqi <=50~"Good", aqi >50& aqi <=100~"Moderate", aqi >100& aqi <=150~"Unhealthy for Sensitive Groups", aqi >150& aqi <=200~"Unhealthy", aqi >200& aqi <=300~"Very Unhealthy", aqi >300~"Hazardous",TRUE~NA_character_# Default case for missing or unexpected values ), levels =c("Good", "Moderate", "Unhealthy for Sensitive Groups", "Unhealthy", "Very Unhealthy", "Hazardous")) ) |>group_by(Month, aqiCat) |>summarise(count =n(), .groups ="drop") |>complete(Month, aqiCat, fill =list(count =0)) -> air.levelsair.levels
air.levels |>ggplot(aes(aqiCat, count, fill = aqiCat)) +# Map aqiCat to fillgeom_col() +facet_wrap(~Month) +ggtitle("Lienchiang's AQI in the winter of 2023") +labs(x ="Air Quality Categories", y ="Count") +scale_fill_manual(values =c("green", "yellow", "orange", "red", "purple", "maroon")) +# Use custom colorstheme(axis.text.x =element_text(angle =20, hjust =1),legend.position ="none"# Optionally hide the legend if redundant )
Here is the graph of Lienchiang’s aqi in the winter of 2023. I chose these colors because they match the official aqi level colors that I found online. Looking at this graph, one can see that the aqi levels are mostly in the moderate category. This is somewhat surprising, I expected these values to be much worse. One positive thing is that this winter never had any very unhealthy or hazardous readings. Also, almost every month in the figure has the same story where the moderate category dominates, while the good category is second, except for February. In that month, most of the aqi readings were actually good! Overall, some very interesting findings in this data set.